knitr::opts_chunk$set(warning=FALSE)
library(dplyr)
library(countrycode)
library(outliers)
library(caret)
library(cluster)
library(factoextra)
library(NbClust)
Prediction of cyber security employees’ salaries based on 11 attributes
1.work_year
2.experience_level
3.employment_type
4.job_title
5.salary
6.salary_currency
7.salary_in_usd
8.employee_residence
9.remote_ratio
10.company_location
11.company_size
We are living in the “information age” or rather the “data age”, meaning that everything around us revolves around data. The data has become one of the most valuable assets that a person or an organisation can have, since it has a significant value, losing it will lead to significant damages. Thus, most of the attacks nowadays are directed toward the data. To guard against such damages, organisations have realised the importance of protecting their digital assets, leading them to hire cybersecurity specialists. This made cybersecurity gain popularity among people so there’s a growing tendency to study cybersecurity. Consequently this resulted in the emergence of plentiful professionals with various experience levels and skills in this field. As a result, organisations may find it difficult to decide a salary for job candidates solely based on the CV. also, since the attacks improve rapidly, organisations need to hire more employees in the far future to defend against such attacks but it’s not an easy matter to predict the future payroll which may hinders some of the organisation’s plans. Another issue arises when the decision makers in the organisation aren’t fully aware of the trends on salary. Their lack of awareness gives a chance for the competitor organisations to attract their employees to them by offering a better salary that match current trends
Prediction of the cyber security employees’ salary categories (Very Low, Low, , High, Very High) using classification methods.
Given the problems we discussed and In order to better understand this field, we decided to analyse a dataset of 1247 cybersecurity employees, containing information such as salary, job title, and experience level. Analysing this dataset can provide insightful predictions regarding the salary range of a cybersecurity employee, which can help in
https://www.kaggle.com/datasets/deepcontractor/cyber-security-salaries
dataset= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/salaries_cyber.csv"), header=TRUE)
View(dataset)
we will keep a copy of the original dataset before data preprocessing to use if needed at any time
originalDataset= dataset
No. of attributes: 11
Type of attributes: Ordinal , Nominal, and Numeric
No. of objects: 1247
Class label: salary_in_usd
ncol(dataset)
nrow(dataset)
names(dataset)
str(dataset)
| Attribute Name | Description | Data Type | Possible values |
|---|---|---|---|
| work_year | The year in which salary was recorded | Numerical | 2020 to 2022 |
| experience_level | Expertise level of the employee | Ordinal | En “Entry level” MI “Mid level” SE “Senior level” EX “Executive level” |
| employment_type | The nature or category of employee’s engagement in the job | Nominal | PT “Part time” FT “Full time” CT “Contract FL”Freelancer” |
| job_title | The role worked in during the year | Nominal | Different titles. like Security Analyst, security researcher |
| salary | The total gross salary amount paid | Numerical | 1740-50001566 |
| salary_currency | The currency of the salary paid to the employee | Nominal | Different currencies according to ISO 4217 currency code. like DE,CA |
| salary_in_usd | The salary paid in United states dollar | Numerical | 2000 to 365596.40 |
| employee_residence | Employee’s primary country of residence | Nominal | Different countries. like US,AE |
| remote_ratio | Percentage of online work by employee in the specified year | Numerical | 0 “No remote work” 50 “Partially remote” 100 “Fully remote” |
| company_location | The country of the employer’s main office | Nominal | Different countries. like BR,BW |
| company_size | How big/small is the company | Ordinal | S , M or L |
using sample_n(table,size) function and using (set_seed())
set.seed(30)
sample=sample_n(dataset,20)
print(sample)
if it is FALSE it means no null value,if it is TRUE there is null value. In our dataset there is no null values.
is.na(dataset)
sum(is.na(dataset))
summary(dataset$work_year)
summary(dataset$salary)
summary(dataset$salary_in_usd)
summary(dataset$remote_ratio)
var(dataset$work_year)
var(dataset$salary)
var(dataset$salary_in_usd)
var(dataset$remote_ratio)
Here we used boxplot to see the distribution between salary_in_usd and experience_level We observed that salaries vary depending on the level of experience,they are positively correlated.
boxplot(salary_in_usd ~ experience_level, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and work_year We observed that 2021 salaries were close to each other but in 2022 the gap between them getting bigger.
boxplot(salary_in_usd ~ work_year, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and employment_type We observed that Full Time (FT) offers more salary than the other categories.
boxplot(salary_in_usd ~ employment_type, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and company_size We observed that the larger the company is the higher the salary was.
boxplot(salary_in_usd ~ company_size, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
The “salary” column gives the same information as “salary_in_usd” it’s just a matter of currency exchange, and we will eventually transform all the values in “salary” column to one common currency so we can properly deal with them. To further confirm that the two column are redundant, we will use the latest exchange rate for USD to the desired currency.
we will start by creating a temporary column named “converted_salary” to save the salary that we will get by using the exchange rate to convert the salary_in_usd to the salary with different currencies to compare with “salary” column
convertedDataset=dataset
convertedDataset$exchange_rate = factor(convertedDataset$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1/1,1/0.20,1/1.22,1/1.06,1/0.012,1/0.74,1/1.10,1/0.14,1/0.73,1/0.64,1/0.090,1/0.057,1/0.26,1/0.23,1/0.093,1/0.000065,1/0.60,1/0.0027,1/0.053,1/0.031,1/0.010))
convertedDataset$exchange_rate = as.numeric(as.character(convertedDataset$exchange_rate))
convertedDataset$converted_salary = convertedDataset$salary_in_usd*convertedDataset$exchange_rate
set.seed(1)
salary_sample <- sample_n(convertedDataset[,c("salary","converted_salary")],10)
print(salary_sample)
as shown in the sample, the two columns are almost identical. This can be proved by correlation coefficient as well.
correlation <- cor(convertedDataset$salary , convertedDataset$converted_salary)
print(correlation)
The correlation is so high but it hasn’t reached 100% possibly due to rounding in the calculations and slight differences in the exchange rate over time.
To make the mining process more effiecent and has an improved quality, we decided to remove the “salary” column.
dataset<-dataset[,-c(5)]
We will show outliers with boxPlots and then remove them, to minimize noise and to get better analytical results when applying data mining techniques.
now we show (salary_in_usd) attributes’ outliers. we can see that there are many outliers with exceptionally high values, thus we will remove them.
boxplot(dataset$salary_in_usd)
OutSalary = outlier(dataset$salary_in_usd, logical =TRUE)
Find_outlier = which(OutSalary ==TRUE, arr.ind = TRUE)
dataset= dataset[-Find_outlier,]
now we show (remote_ratio) attributes’ outliers. we can see there aren’t outliers in remote_ratio, thus we don’t need the last step i.e: removing outliers’ rows.
boxplot(dataset$remote_ratio)
now we show (work_year) attributes’ outliers. we can see there aren’t outliers in work_year, thus we don’t need the last step i.e: removing outliers’ rows.
boxplot(dataset$work_year)
the columns “company_location” and “employee_residence” have the name of countries for the company and employee respectively. And these attributes can be generalized to higher-level concept that is region to help understand and analyze the dataset better and improve algorithm performance.
We will use the 7 regions as defined in the World Bank Development Indicators. These regions are:
East Asia and Pacific: This region includes countries like China, Australia, Indonesia, Thailand, etc.
Europe and Central Asia: This region includes countries like Germany, UK, Russia, Turkey, etc.
Latin America & Caribbean: This region includes countries like Brazil, Mexico, Argentina, Cuba, etc.
Middle East and North Africa: This region includes countries like Saudi Arabia, Egypt, Iran, Iraq, etc.
North America: This is predominantly United States and Canada.
South Asia: This region includes countries like India, Pakistan, Bangladesh, Sri Lanka, etc.
Sub-Saharan Africa: This region includes countries like Nigeria, South Africa, Ethiopia, Kenya, etc.
Note: UM(The United States Minor Outlying Islands) and AQ(Antarctica) don’t belong to any of these regions, thus, they will be used as they are.
um=which(dataset$company_location=="UM")
aq=which(dataset$company_location=="AQ")
dataset$company_location <- countrycode(dataset$company_location, "iso2c", "region")
dataset$employee_residence <- countrycode(dataset$employee_residence, "iso2c", "region")
dataset[um,"company_location"]="UM"
dataset[aq,"company_location"]="AQ"
Concept hierarchy generation can be done for “job_title” as well to improve interpretation and scalability. Also, most job titles are essentially the same job but with different names, so we can combine them into a higher-level jobs titles such as Architect, Analyst and Engineer etc.
## Create the categories based on job rank
dataset$job_title <- ifelse(grepl("Analyst", dataset$job_title), "Analyst",
ifelse(grepl("Architect", dataset$job_title), "Architect",
ifelse(grepl("Engineer", dataset$job_title), "Engineer",
ifelse(grepl("Manager|Officer|Director|Leader", dataset$job_title), "Leadership",
ifelse(grepl("Consultant|Specialist", dataset$job_title), "Consultant/Specialist",
ifelse(grepl("Cyber", dataset$job_title), "Cyber Security",
"Others"))))))
To deal with columns with character type we are going to encode them, because most machine learning algorithms are designed to work with factors data rather than character data and it improves performance and Interpretability of data as well.
dataset$job_title <- factor(dataset$job_title)
dataset$experience_level = factor(dataset$experience_level, levels=c("EN", "MI", "SE", "EX"), labels=c(1,2,3,4))
dataset$employment_type <- factor(dataset$employment_type)
dataset$employee_residence <- factor(dataset$employee_residence)
dataset$company_location <- factor(dataset$company_location)
dataset$salary_currency <- factor(dataset$salary_currency)
dataset$job_title <- factor(dataset$job_title)
dataset$company_size = factor(dataset$company_size, levels=c("S","M","L"), labels=c(1,2,3))
dataset$job_title <- factor(dataset$job_title)
by calculating breaks based on quartiles
breaks <- quantile(dataset$salary_in_usd,
probs = c(0, .25, .5, .75, .95, 1),
na.rm = TRUE)
dataset$salary_in_usd <- cut(dataset$salary_in_usd,
breaks = breaks,
include.lowest = TRUE,
labels=c("Very Low", "Low", "Medium", "High", "Very High"))
to change the scale of numeric attributes (remote_ratio and work_year) to a scale of [-1,1] to give them equal weight
dataset [, c("work_year" , "remote_ratio")] = scale(dataset [, c("work_year" , "remote_ratio")])
we will implement feature selection to remove redundant or irrelevant attributes from the data set to get the smallest subset that can help us get the most accurate predictions for our target class(salary_in_usd) and decrease the time that it takes the classifier to process the data.
we will use RFE(Recursive feature elimination) which is a wrapper method for the feature selection. Since the RFE function have multiple control options we need to specify the options that we want. We will choose “Random Forest” because it has high accuracy, can handle categorical data.
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
number = 10)
First we save the features to be used in the feature selection(every attributes except the class label “salary_in_usd”) in variable x, and the class label in variable y. Then split the data to 80% training and 20% test.
x <- dataset %>%
select(-salary_in_usd) %>%
as.data.frame()
# Target variable
y <- dataset$salary_in_usd
# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = .80, list = FALSE)[,1]
x_train <- x[ inTrain, ]
x_test <- x[-inTrain, ]
y_train <- y[ inTrain]
y_test <- y[-inTrain]
after splitting the data, now we can perform the selection using rfe
set.seed(1)
result_rfe1 <- rfe(x = x_train,
y = y_train,
sizes = c(1:9),
rfeControl = control)
result_rfe1
predictors(result_rfe1)
The results show that all the remaining attributes, except for “employment_type”, are selected. This is logical, as 98% of the rows have the value “FT”, as shown in the table below. Due to the low variance, we decided to remove this attribute.
table(dataset$employment_type)
dataset<-dataset[,-which(names(dataset)=="employment_type")]
dataset2= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/preprocessedDataset.csv"), header=TRUE)
char_vars <- sapply(dataset2, is.character)
dataset2[char_vars] <- lapply(dataset2[char_vars], as.factor)
To resolve the problem of class imbalance in the dataset, we will use SMOTE() method that oversample the minority class by creating synthetic samples using the existing minority class samples
data_balanced <- SMOTE(salary_in_usd ~ ., dataset2, perc.over = 300, perc.under=500, k = 10)
The goal of all preceding steps is to properly prepare the dataset for the classification phase, which constitutes one of our primary mining objectives. In this section, we will employ various attribute selection methods such as the Gini index, Gain ratio, and information gain to construct a decision tree model. We will thoroughly evaluate its performance, and if it proves effective, it can subsequently be utilized to classify new instances with unknown class labels.
since our dataset is small, we decided to use K-fold Cross-validation. for each attribute selection method we will try different K size (10,5, and 3)
the following function will be used to copute average sensitivity and Specificity
macro = function(matrix){
sumSen=0
for (i in 1:5) {
sumSen = sumSen + matrix$byClass[i,1]
}
avgSen = sumSen/5
sumSpec=0
for (i in 1:5) {
sumSpec = sumSpec + matrix$byClass[i,2]
}
avgSpec = sumSpec/5
avgs = data.frame(Sensitivity=avgSen , Specificity=avgSpec, Accuracy= unname( matrix$overall[1]) )
print(avgs)
}
Gini index measures the impurity of the dataset. The partitioning that yields the most substantial reduction in impurity is selected as the splitting attribute. To apply the Gini index, we will employ the “rpart” method, which utilizes the Gini index as the criteria for splitting.
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
giniIndex10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)
prp(giniIndex10$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)
NA
NA
NA
NA
caret::confusionMatrix(giniIndex10$pred$obs,giniIndex10$pred$pred)
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 0 29 122 18 5
Low 0 133 67 1 69
Medium 0 71 134 9 23
Very_High 0 5 98 146 3
Very_Low 0 93 52 6 113
Overall Statistics
Accuracy : 0.4394
95% CI : (0.4111, 0.4681)
No Information Rate : 0.3952
P-Value [Acc > NIR] : 0.001007
Kappa : 0.2891
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity NA 0.4018 0.2833 0.8111
Specificity 0.8546 0.8418 0.8577 0.8958
Pos Pred Value NA 0.4926 0.5654 0.5794
Neg Pred Value NA 0.7864 0.6469 0.9640
Prevalence 0.0000 0.2765 0.3952 0.1504
Detection Rate 0.0000 0.1111 0.1119 0.1220
Detection Prevalence 0.1454 0.2256 0.1980 0.2105
Balanced Accuracy NA 0.6218 0.5705 0.8534
Class: Very_Low
Sensitivity 0.5305
Specificity 0.8465
Pos Pred Value 0.4280
Neg Pred Value 0.8928
Prevalence 0.1779
Detection Rate 0.0944
Detection Prevalence 0.2206
Balanced Accuracy 0.6885
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
giniIndex5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)
prp(giniIndex5$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)
NA
NA
caret::confusionMatrix(giniIndex5$pred$obs,giniIndex5$pred$pred)
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 0 23 99 41 11
Low 0 118 51 17 84
Medium 0 66 107 36 28
Very_High 0 4 82 162 4
Very_Low 0 85 39 19 121
Overall Statistics
Accuracy : 0.4244
95% CI : (0.3962, 0.453)
No Information Rate : 0.3158
P-Value [Acc > NIR] : 1.978e-15
Kappa : 0.2692
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity NA 0.39865 0.28307 0.5891
Specificity 0.8546 0.83130 0.84127 0.9024
Pos Pred Value NA 0.43704 0.45148 0.6429
Neg Pred Value NA 0.80798 0.71771 0.8804
Prevalence 0.0000 0.24728 0.31579 0.2297
Detection Rate 0.0000 0.09858 0.08939 0.1353
Detection Prevalence 0.1454 0.22556 0.19799 0.2105
Balanced Accuracy NA 0.61497 0.56217 0.7457
Class: Very_Low
Sensitivity 0.4879
Specificity 0.8493
Pos Pred Value 0.4583
Neg Pred Value 0.8639
Prevalence 0.2072
Detection Rate 0.1011
Detection Prevalence 0.2206
Balanced Accuracy 0.6686
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
giniIndex3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart",trControl = ctrl)
prp(giniIndex3$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)
NA
NA
caret::confusionMatrix(giniIndex3$pred$obs,giniIndex3$pred$pred)
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 0 30 76 62 6
Low 0 161 33 28 48
Medium 0 86 80 60 11
Very_High 0 6 59 176 11
Very_Low 0 103 20 27 114
Overall Statistics
Accuracy : 0.4436
95% CI : (0.4152, 0.4723)
No Information Rate : 0.3225
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.292
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity NA 0.4171 0.29851 0.4986
Specificity 0.8546 0.8656 0.83100 0.9100
Pos Pred Value NA 0.5963 0.33755 0.6984
Neg Pred Value NA 0.7573 0.80417 0.8127
Prevalence 0.0000 0.3225 0.22389 0.2949
Detection Rate 0.0000 0.1345 0.06683 0.1470
Detection Prevalence 0.1454 0.2256 0.19799 0.2105
Balanced Accuracy NA 0.6413 0.56475 0.7043
Class: Very_Low
Sensitivity 0.60000
Specificity 0.85104
Pos Pred Value 0.43182
Neg Pred Value 0.91854
Prevalence 0.15873
Detection Rate 0.09524
Detection Prevalence 0.22055
Balanced Accuracy 0.72552
The gain ratio, a normalized measure of information gain, is calculated by dividing information gain by the split information. The attribute that yields the highest gain ratio is chosen as the splitting attribute. The C4.5 algorithm employs the gain ratio.
The J48 is the Java-based open-source implementation of the C4.5 algorithm, and it is included in the Weka package. This implementation allows users to conveniently apply the C4.5 decision tree.
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
gainRatio10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio10$finalModel)
gainRatio10cm = caret::confusionMatrix(gainRatio10$pred$obs, gainRatio10$pred$pred)
gainRatio10cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 100 18 32 21 3
Low 28 148 40 3 51
Medium 53 49 115 14 6
Very_High 28 6 18 194 6
Very_Low 2 39 11 5 207
Overall Statistics
Accuracy : 0.6383
95% CI : (0.6103, 0.6655)
No Information Rate : 0.2281
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5465
Mcnemar's Test P-Value : 0.167
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity 0.47393 0.5692 0.53241 0.8186
Specificity 0.92495 0.8698 0.87564 0.9396
Pos Pred Value 0.57471 0.5481 0.48523 0.7698
Neg Pred Value 0.89150 0.8792 0.89479 0.9545
Prevalence 0.17627 0.2172 0.18045 0.1980
Detection Rate 0.08354 0.1236 0.09607 0.1621
Detection Prevalence 0.14536 0.2256 0.19799 0.2105
Balanced Accuracy 0.69944 0.7195 0.70402 0.8791
Class: Very_Low
Sensitivity 0.7582
Specificity 0.9383
Pos Pred Value 0.7841
Neg Pred Value 0.9293
Prevalence 0.2281
Detection Rate 0.1729
Detection Prevalence 0.2206
Balanced Accuracy 0.8483
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
gainRatio5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio5$finalModel)
gainRatio5cm=caret::confusionMatrix(gainRatio5$pred$obs, gainRatio5$pred$pred)
gainRatio5cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 102 21 34 16 1
Low 31 148 38 1 52
Medium 56 46 103 15 17
Very_High 30 7 18 194 3
Very_Low 3 42 10 9 200
Overall Statistics
Accuracy : 0.6241
95% CI : (0.5959, 0.6516)
No Information Rate : 0.2281
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5289
Mcnemar's Test P-Value : 0.007667
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity 0.45946 0.5606 0.50739 0.8255
Specificity 0.92615 0.8692 0.86519 0.9397
Pos Pred Value 0.58621 0.5481 0.43460 0.7698
Neg Pred Value 0.88270 0.8749 0.89583 0.9566
Prevalence 0.18546 0.2206 0.16959 0.1963
Detection Rate 0.08521 0.1236 0.08605 0.1621
Detection Prevalence 0.14536 0.2256 0.19799 0.2105
Balanced Accuracy 0.69281 0.7149 0.68629 0.8826
Class: Very_Low
Sensitivity 0.7326
Specificity 0.9307
Pos Pred Value 0.7576
Neg Pred Value 0.9218
Prevalence 0.2281
Detection Rate 0.1671
Detection Prevalence 0.2206
Balanced Accuracy 0.8317
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
gainRatio3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio3$finalModel)
gainRatio3cm=caret::confusionMatrix(gainRatio3$pred$obs, gainRatio3$pred$pred)
gainRatio3cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 94 18 39 19 4
Low 25 129 39 3 74
Medium 47 47 110 19 14
Very_High 14 4 27 200 7
Very_Low 3 32 9 12 208
Overall Statistics
Accuracy : 0.619
95% CI : (0.5909, 0.6467)
No Information Rate : 0.2565
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5216
Mcnemar's Test P-Value : 0.007322
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity 0.51366 0.5609 0.4911 0.7905
Specificity 0.92110 0.8542 0.8695 0.9449
Pos Pred Value 0.54023 0.4778 0.4641 0.7937
Neg Pred Value 0.91300 0.8910 0.8812 0.9439
Prevalence 0.15288 0.1921 0.1871 0.2114
Detection Rate 0.07853 0.1078 0.0919 0.1671
Detection Prevalence 0.14536 0.2256 0.1980 0.2105
Balanced Accuracy 0.71738 0.7075 0.6803 0.8677
Class: Very_Low
Sensitivity 0.6775
Specificity 0.9371
Pos Pred Value 0.7879
Neg Pred Value 0.8939
Prevalence 0.2565
Detection Rate 0.1738
Detection Prevalence 0.2206
Balanced Accuracy 0.8073
all 3 trees seem to have the same structure that is
the attribute that was first selected at the node is the experience level, it has divided the tree into : right subtree : SE(Senior level) EX(Executive level) left subtree : EN(Entry-level) MI(Mid level)
Each of these subtrees further refines the classification based on the attribute “employee residence.” However, there are different criteria for splitting in the right and left subtrees:
In the Right Subtree:
The split is based on whether the tuple has the value “Latin America & Caribbean.” In the Left Subtree:
If the experience level is 1, the tree further partitions based on whether the tuple has the value “North America.” If the experience level is 2, the split is based on “employee residence” being “Latin America & Caribbean.”
rbind("10 Folds"=macro(gainRatio10cm), "5 Folds"=macro(gainRatio5cm), "3 Folds"=macro(gainRatio3cm) )
Based on the evaluation metrics of Sensitivity, Specificity, and Accuracy, it is evident that the gain ratio model, built using a 10-fold cross-validation approach, exhibits superior performance compared to the other two models. However, it’s worth noting that the difference in performance between the models is relatively small. Notably, as the number of folds decreases, a corresponding decline in the model’s performance becomes apparent.
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
infoGain10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 3 for this object. Predictions generated using 3 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trials
c5model <- C5.0(salary_in_usd ~ .,
data = data_balanced,
trials = infoGain10$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain10$bestTune$winnow))
plot(c5model)
caret::confusionMatrix(infoGain10$pred$obs, infoGain10$pred$pred)
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 90 19 40 20 5
Low 24 127 42 10 67
Medium 51 58 96 19 13
Very_High 17 3 15 208 9
Very_Low 3 43 4 10 204
Overall Statistics
Accuracy : 0.6057
95% CI : (0.5773, 0.6335)
No Information Rate : 0.249
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.5046
Mcnemar's Test P-Value : 0.03427
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity 0.48649 0.5080 0.4873 0.7790
Specificity 0.91700 0.8490 0.8590 0.9527
Pos Pred Value 0.51724 0.4704 0.4051 0.8254
Neg Pred Value 0.90714 0.8673 0.8948 0.9376
Prevalence 0.15455 0.2089 0.1646 0.2231
Detection Rate 0.07519 0.1061 0.0802 0.1738
Detection Prevalence 0.14536 0.2256 0.1980 0.2105
Balanced Accuracy 0.70174 0.6785 0.6732 0.8659
Class: Very_Low
Sensitivity 0.6846
Specificity 0.9333
Pos Pred Value 0.7727
Neg Pred Value 0.8992
Prevalence 0.2490
Detection Rate 0.1704
Detection Prevalence 0.2206
Balanced Accuracy 0.8089
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
infoGain5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trials
c5model <- C5.0(salary_in_usd ~ .,
data = data_balanced,
trials = infoGain5$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain5$bestTune$winnow))
plot(c5model)
caret::confusionMatrix(infoGain5$pred$obs, infoGain5$pred$pred)
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 85 20 47 16 6
Low 25 129 38 10 68
Medium 37 70 99 14 17
Very_High 16 4 13 210 9
Very_Low 2 37 9 13 203
Overall Statistics
Accuracy : 0.6065
95% CI : (0.5782, 0.6343)
No Information Rate : 0.2531
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5049
Mcnemar's Test P-Value : 0.001691
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity 0.51515 0.4962 0.48058 0.7985
Specificity 0.91376 0.8495 0.86075 0.9550
Pos Pred Value 0.48851 0.4778 0.41772 0.8333
Neg Pred Value 0.92180 0.8587 0.88854 0.9439
Prevalence 0.13784 0.2172 0.17210 0.2197
Detection Rate 0.07101 0.1078 0.08271 0.1754
Detection Prevalence 0.14536 0.2256 0.19799 0.2105
Balanced Accuracy 0.71446 0.6728 0.67066 0.8768
Class: Very_Low
Sensitivity 0.6700
Specificity 0.9318
Pos Pred Value 0.7689
Neg Pred Value 0.8928
Prevalence 0.2531
Detection Rate 0.1696
Detection Prevalence 0.2206
Balanced Accuracy 0.8009
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
infoGain3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trials
c5model <- C5.0(salary_in_usd ~ .,
data = data_balanced,
trials = infoGain3$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain3$bestTune$winnow))
plot(c5model)
caret::confusionMatrix(infoGain3$pred$obs, infoGain3$pred$pred)
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 85 25 40 19 5
Low 19 137 37 5 72
Medium 55 64 93 11 14
Very_High 16 8 22 197 9
Very_Low 4 43 11 9 197
Overall Statistics
Accuracy : 0.5923
95% CI : (0.5639, 0.6203)
No Information Rate : 0.2481
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.4874
Mcnemar's Test P-Value : 0.01149
Statistics by Class:
Class: High Class: Low Class: Medium Class: Very_High
Sensitivity 0.47486 0.4946 0.45813 0.8174
Specificity 0.91257 0.8554 0.85513 0.9425
Pos Pred Value 0.48851 0.5074 0.39241 0.7817
Neg Pred Value 0.90811 0.8490 0.88542 0.9534
Prevalence 0.14954 0.2314 0.16959 0.2013
Detection Rate 0.07101 0.1145 0.07769 0.1646
Detection Prevalence 0.14536 0.2256 0.19799 0.2105
Balanced Accuracy 0.69372 0.6750 0.65663 0.8799
Class: Very_Low
Sensitivity 0.6633
Specificity 0.9256
Pos Pred Value 0.7462
Neg Pred Value 0.8928
Prevalence 0.2481
Detection Rate 0.1646
Detection Prevalence 0.2206
Balanced Accuracy 0.7944
#a)fviz_nbclust() with silhouette method using library(factoextra)
fviz_nbclust(dataset, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
Error in fviz_nbclust(dataset, kmeans, method = "silhouette") :
could not find function "fviz_nbclust"